Consideraciones generales

Para una correcta participación en la competencia Data Thriatlon sigue este formato de notebook propuestó por el equipo de data de Platzi. Esto ayudará en la calificación de tu notebook.

Secciones del notebook:

*Para revisar las reglas de la competencia ve a este blogpost.

¿Qué hay en los datos?

En los datos encontrarás información de las exportaciones e importaciones de Colombia de 1962 a 2017. Fueron extraídos del OEC: The Observatory of Economic Complexity.

Tu labor será desarrollar un análisis exploratorio para encontrar insights valiosos a partir de esos datos y otros que recopiles.

Preguntas

Para el desarrollo de tu análisis puedes resolver las siguientes preguntas:

  1. ¿Cuál es el top 10 países a los que Colombia exporta sus productos y su evolución con respecto al tiempo?
  2. ¿Cuál es el top 10 países de los que Colombia importa sus productos y su evolución con respecto al tiempo?
  3. ¿Existe algún producto que debido a cambios económicos en la oferta/demanda halla causado un decremento en el volumen de dinero que se mueve con él?
  4. ¿Cuáles sectores económicos tienen mayor importancia en las exportaciones de Colombia y por qué?

También puedes resolver preguntas que a ti se te ocurran.

Extracción de datos adicionales (0-20%)

Para extraer información adicional a la proporcionada por Platzi puedes extraerla directamente de la fuente de datos original. El dataset cuenta con la información extraída del OEC: The Observatory of Economic Complexity, específicamente de la versión legacy la cual permite la descarga de archivos a partir de una URL y de la API expuesta por ellos.

Para tu análisis recolecta más datos de esta y otras fuentes que veas necesarias.

Limpieza y transformación de datos (0-20%)

Para la limpieza de datos puedes utilizar herramientas como Pandas y numpy con el fin de limpiar y estructurar todo tipo de datos nulos o vacíos que no sean necesarios para el análisis de los datos requeridos.

First goal

In this first step we have the goal of make a full data set with all the fields available, so we can use the primary and outsiders keys given in each fields based on the transactional datasets as the exports and imports files

Let's go to fit this data with this steps

Labeling the transactional data sets

Add the country names field

Add the product names

Add geographical data

Results

Finally we can check in one sight the completeness level of all columns

From this graphic we can see in the horizontal axis all the columns from the current status of the dataset, and for the vertical axis the rows, filling it you can see if the table cell is filled or is a missing value, with a black or white line respectively

So we can see too few missing values from export and import columns, and origin and destiny name. We will measure them latter

Análisis exploratorio (analytics, data visualization y storytelling (0-60%)

El análisis exploratorio es parte fundamental para responder las preguntas propuestas por el equipo de data scientists de Platzi.

Basic Info

First let's go to explore the basic info about the last seccion results

From the basic info we can see categorical data like type of transaction, name of the origin and destiny country and labels

We can see also numerical data, like the import and export value transaction, latitude and longitude and the year

Let's go to explore basic description for all kind of data

As it was said the data was given since 1962 to 2017, later we can explore better the money and geographical values

We have five destiny's more than origin countries, two type of transactions and 936 different products

As we knew from our previous results, there are to few null data, just 8%, 6% and less than zero% for Export and import values and names respectively. This magnitudes doesn't matter in the future analysis

Numerical data

Let's go to watch the distribution of numerical data

As we can see we have more transaction from 2017 than the pder years, so we hope to fine more transaction value about this year

In this graphic we can see the evolution of the export and import value from all the export records, it tends to be very simmilar and have his highest value on 2015 and the lower values are from Sixties

The highest value is from 2015 despite we have more transactions on 2007 as we saw before

In this graphic we can see the evolution of the export and import value from all the export records, it tends to be very simmilar and have his highest value on 2015 and the lower values are from Sixties

Is very simmilar to the previous one

Categorical data

Let's go to watch the visualization of the quantity (count) of the categorical data

There are 0.7% more records from export data

The Coffe is the product with more transactions

United States and Germany are the countries with more transactions in imports and exports

Colombian Transactions

Now we're going to explore the value of export and import with all the categorical variables but we are going to do this just with the value of Colombian Transactions

As we can see historically Colombia have more exports than imports, and that is a good balance signal

As we can see historically Colombia imports Cars and many others unclasificated stuffs

As we can see we have the typical exported products, like Crude Petroleum, Coffe, coal, Bananas and flowers

United States, China and Mexico are the countries with the highest Import value along the history

United States and Venezuela are the countries with the highest Export value along the history

Geografical data

Let's go to watch the visualization of geografical data

We can see where the exports goes all around the world, as more red the point higher the value of the transactions

We can see where the imports comes from all around the world, as more red the point higher the value of the transactions

Evolution

Let's go to see the evolution of the value of top products along the timeline

We can see how at the eigties the coffee was the best export product, but in the 21st century the crude petroleum and the coal, lead this scenario ahead the Coffee

We can see that in the last to years appeared a lot of stuffs that doesn't have classification yet, but products like Cars and Large aircrafts have been imported in with a higher value than the other products

We can see how since the eighties the export to the World has been rising like linearly, followed by USA and China in the last 10 years

Here we can see that the records of World imports finished in 1983 and how USA is the best exporter for colombia and in the las ten years China and Mexico have been leading the import values to Colombia

Product Country Analysis

Let's go to see which products from wich countrys have the best impact in Colombia's Trade balance

In this graphic we can see exactly historically which products we sell to which countries with the highest value transactions, the Crude Petroleum to USA and the world, and coffe to the world

In the last ten years we have Crude Petroleum sold to China and Panama and Gold to USA

The Imports have been so different along the history, we have bought Lubricating Petroleum Oils to United States, Large Aircraft to France, Cars to Mexico and TV and Radio Transmitters to China

The Colombian Trade balance

The trade balance is the difference that exists between the total exports and imports of a country

From the Trade Balance Evolution grapic, we can see how Colombia has been losing money the last to years of the series, and how Colombia had been had a good performance in the 21st Century despite last to years

In the very beggining we can see that the value of the money the time doesn't let us compare properly all the history

Modelo (opcional)

Si necesitas agregar un modelo para mejorar tu análisis hazlo en esta sección del notebook. La evaluación será contada como parte del análisis exploratorio y es completamente opcional.

I want to predict the value (Value) of a transaction of a given the product, country an year, so let's go to use the same way of the Colombian Tractions data sets, but with numerical data, to solve this

This Data set, will help me to use the historical behave of all transactions and with all numerical data we can run some models to solve this regression problem

First is elemental to know how the correlation behave between all the variables and we can see how all the variables are well correlated

Linear Regression

Let go to see how a linear Regression performs with this data and let's to pay attention to the p-value of all the components and the R-Squared or R2 metric

As we can see the R-squared metrics tell us that we are above the mean model so we can predict this value with a better stronger model, but we can see that the "Transaction" variable is not statistical significant because of his p-value, so we can try our later model witout this variable

PCA

Let's to see what a PCA analysis perfoms with all the variables

We first centred all the data it is to all the variable start in cero to be compare

As a result we have the just the Year have all the variability to predict the serie and sow as that the lattitude and longitude doesn't matter too much in this proccess

Neural Network

Just for curiosity, let's to see what happend if we use this little data set in a neural network to predict the Value of the transaction, knowing that the "Transations" variable is not usefull and the "Year" variable has all the variability

First test

Let's Start with too few epocs and classical batch size

For this first try we can see that a mean model is better, so we have to train a more complex model

Second test

Let's Start the second try this time with 50 epocs

If you are reading this last message Thanks a lot! I enjoyed this triathlon a lot!